Skip to content

docs: add debugging error lookup table#1884

Merged
heyong4725 merged 2 commits into
dora-rs:mainfrom
GHX5T-SOL:docs-1878-error-table
May 26, 2026
Merged

docs: add debugging error lookup table#1884
heyong4725 merged 2 commits into
dora-rs:mainfrom
GHX5T-SOL:docs-1878-error-table

Conversation

@GHX5T-SOL
Copy link
Copy Markdown
Contributor

Addresses the common-error-message table slice of #1878.

What changed:

  • Added a Common Error Messages section to docs/debugging.md.
  • Linked it from the existing table of contents.
  • Mapped recurring daemon/node error fragments to likely causes and next diagnostic commands.

Validation:

  • git diff origin/main..HEAD --check
  • git show --format= --patch HEAD | gitleaks stdin --no-banner --redact --timeout 30 -> no leaks found

Not run: cargo tests; docs-only change.

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 20, 2026

😎 Merged manually by @heyong4725 - details.

Copy link
Copy Markdown
Collaborator

@phil-opp phil-opp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@github-actions
Copy link
Copy Markdown
Contributor

@GHX5T-SOL the Trunk merge queue failed for this PR.

See the Trunk merge-status comment for details.

Posted as a new comment so GitHub sends an email — Trunk's sticky comment is edited in place and won't trigger a notification.

@GHX5T-SOL GHX5T-SOL force-pushed the docs-1878-error-table branch from 600d842 to 9cf6f93 Compare May 20, 2026 13:27
@GHX5T-SOL
Copy link
Copy Markdown
Contributor Author

Rebased this branch on current origin/main after the merge-queue pip-release all green failure. The diff is still docs-only.

Validation:

  • git diff --check origin/main...HEAD
  • git diff origin/main...HEAD | gitleaks detect --pipe --redact --no-banner -> no leaks found

@GHX5T-SOL
Copy link
Copy Markdown
Contributor Author

/trunk merge

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 20, 2026

An error occurred while submitting your PR to the queue: Only users that are a part of this repo's Trunk organization or have write permissions to the repo can submit a PR to the queue

@heyong4725
Copy link
Copy Markdown
Collaborator

Did a verification pass on the table. Every error message fragment was sourced from a real codebase string — I grep-checked all 13 and they all match. The coordinator heartbeat timeout (20s) literal is preserved verbatim from binaries/daemon/src/lib.rs:993. Nice sourcing.

Two notes worth flagging:

1. dora logs --daemon is not a real flag

In the "failed to register node with dora-daemon" row, the Fix column says Check dora logs --daemon. That flag doesn't exist. target/debug/dora logs --help shows the actual surface:

--all-nodes, --tail, --local, --since, --until, --level,
--log-format, --log-filter, --grep, --coordinator-addr, --coordinator-port

If the intent was "read the daemon's own logs," there isn't a first-class path for that — closest is dora logs --local (reads the out/<dataflow>/log_*.jsonl files which include daemon-side log entries) or grepping coordinator/daemon process stderr from dora up's spawn. Suggest reframing as something like:

Inspect the daemon's stderr (the process started by dora up); confirm the node ID and dataflow ID match the current run.

2. Positional dora logs <dataflow> <node> will conflict with #1883

Three of your Fix cells use dora logs <dataflow> <node> (the old positional shape). Your other PR (#1883) deprecates that in favor of dora logs <dataflow> --node <node>. If #1883 lands first, this doc is instantly stale on those rows.

Two options depending on merge order:

Either way, the rows affected are: "node ... not connected", "node ... channel closed", and possibly "node ... channel full" (it inspects logs implicitly).


Aside from those, the table is a solid addition. Verified dora node info -d, dora param list/set, DORA_DAEMON_LOCAL_LISTEN_PORT, and dora down/up all behave as the doc describes.

@GHX5T-SOL
Copy link
Copy Markdown
Contributor Author

Addressed the debugging-table/doc syntax notes in 917b4f2.\n\nChanges:\n- Replaced the nonexistent dora logs --daemon guidance with the suggested daemon-stderr direction for the process started by dora up.\n- Updated stale positional node examples in docs/debugging.md to use dora logs <dataflow> --node <node>.\n- Updated the local-log example to dora logs --local --node <node>.\n\nValidation:\n- grep confirmed no remaining dora logs --daemon references in docs/debugging.md\n- grep confirmed no remaining stale dora logs <dataflow> <node> examples in docs/debugging.md; the only remaining match is the valid dataflow-only dora logs <uuid> form\n- git diff --check\n- changed-doc diff gitleaks scan -> no leaks found

@heyong4725
Copy link
Copy Markdown
Collaborator

Nice work on the error table — verified all 13 error fragments exist in the codebase, the "likely cause" and "next step" columns are accurate and actionable.

One issue to fix before this can land

The 7 dora logs --node <name> insertions are invalid CLI syntax. dora logs takes the node name as a positional argument:

dora logs <dataflow> <node>

See binaries/cli/src/command/logs.rs:34 — the node: field has #[clap(value_name = "NAME")] but no #[clap(long)], so it's positional only:

/// Show logs for the given node (omit with --all-nodes)
#[clap(value_name = "NAME")]
pub node: Option<NodeId>,

And dora logs --help confirms:

Usage: dora logs [OPTIONS] [UUID_OR_NAME] [NAME]

A user following this PR's examples will hit error: unexpected argument '--node' found.

7 lines to revert

Line Current (wrong) Should be
86 dora logs my-dataflow --node problem-node --follow --level debug dora logs my-dataflow problem-node --follow --level debug
129 (table) inspect \dora logs --node `` inspect \dora logs ``
131 (table) Use \dora logs --node `` Use \dora logs ``
568 dora logs my-dataflow --node sensor-node --follow dora logs my-dataflow sensor-node --follow
574 dora logs my-dataflow --node sensor-node --follow --level debug dora logs my-dataflow sensor-node --follow --level debug
601 dora logs --local --node sensor-node --tail 50 dora logs --local sensor-node --tail 50
689 dora logs my-dataflow --node problem-node --follow --level trace dora logs my-dataflow problem-node --follow --level trace

The sister doc docs/cli.md:546 already shows the canonical dora logs [UUID_OR_NAME] [NODE] [OPTIONS] and the example at docs/cli.md:1656 (correctly left untouched by this PR) uses the positional form. Aligning with that keeps the two docs consistent.

What's solid (don't change)

The error table itself (lines 117-138) is the meat of the PR and it's well-done:

Verified Error fragment Source
Could not connect to the daemon apis/rust/node/src/error.rs
failed to request node config from daemon apis/rust/node/src/node/mod.rs
failed to register node with dora-daemon apis/rust/node/src/daemon_connection/mod.rs
no running dataflow with ID binaries/coordinator/src/lib.rs
channel full, channel closed daemon + node API
failed to serialize param value for node binaries/coordinator/src/lib.rs
coordinator heartbeat timeout binaries/daemon/src/lib.rs
there is already a running dataflow with ID binaries/coordinator/src/lib.rs
failed to infer JSON schema apis/rust/node/src/daemon_connection/json_to_arrow.rs
Arrow IPC stream contained no record batches apis/rust/node/src/node/arrow_utils.rs
zenoh publish failed apis/rust/node/src/node/mod.rs

Each "likely cause" reads accurate and each "next step" is a runnable command. This is the right level of detail for the lookup table #1878 asked for.

Also worth checking

Branch is 22 commits behind main (last sync 2026-05-20 at 271a0470). A git pull --rebase origin main before merging would catch any unexpected conflicts and put your work on top of current main.

TL;DR

  • Revert the 7 --node insertions to positional (1-line sed job: sed -i '' 's| --node \([a-zA-Z0-9_-]*\)| \1|g' docs/debugging.md, then spot-check)
  • Rebase on main to pick up the 6 days of work that landed since
  • Then this is good to land

@heyong4725
Copy link
Copy Markdown
Collaborator

Correction to my earlier review — I missed PR #1883.

PR #1883 (also from @GHX5T-SOL, addressing issue #1880) is the upstream CLI change that adds the --node/-n flag to dora logs. With that PR landed, the --node syntax this PR introduces in docs/debugging.md is exactly correct.

My earlier P1 was wrong. Sorry for the noise — I should have looked at the contributor's other open PRs before flagging the syntax.

Updated recommendation

Before #1883 lands After #1883 lands
5 command examples with dora logs ... --node X Invalid syntax Correct
2 table rows referencing dora logs <dataflow> --node <node> Invalid syntax Correct
Error table (lines 117-138, the main contribution) ✅ Solid ✅ Solid

Suggested merge sequence

  1. feat(cli): use --node for dora logs node selection #1883 lands first — it ships the --node CLI flag + updates docs/debugging.md in the same 5 places this PR does (plus README, docs/cli, docs/logging, docs/distributed-deployment, docs/quickstart).
  2. Rebase this PR on main — after feat(cli): use --node for dora logs node selection #1883 lands, the 5 syntax-update lines in this PR will be duplicates already on main. git pull --rebase origin main will collapse the diff to just the new error table content + the TOC link.
  3. This PR merges cleanly with a focused diff: ~25 lines of new error-table content in docs/debugging.md.

If you'd rather land #1884 first, you'd need to coordinate with #1883's docs/debugging.md diff manually — but the rebase path is much simpler.

What still stands

@heyong4725 heyong4725 force-pushed the docs-1878-error-table branch from 917b4f2 to 31bf100 Compare May 26, 2026 17:40
@heyong4725
Copy link
Copy Markdown
Collaborator

Rebased on main for you (PR opted into maintainer edits, so I pushed directly). New head: 31bf10000.

Now that #1883 has landed (f8aebe18), the --node syntax this PR uses in docs/debugging.md is correct, and the 5 syntax-update lines that were originally in this diff are already on main — so the rebase collapsed them out automatically.

Before rebase: 1 file changed, 31+/5- (TOC link + error table + 5 syntax adjustments)
After rebase: 1 file changed, 26+/0- (TOC link + error table only)

The error table content is unchanged — all 13 verified error fragments still there.

CI will rerun on the new head; once green, this is ready to merge.

@heyong4725 heyong4725 merged commit 9f4242b into dora-rs:main May 26, 2026
12 checks passed
heyong4725 added a commit that referenced this pull request May 26, 2026
… (#1946)

`examples/rust-dataflow-git/dataflow.yml` was pinned to commit
`10cf7fe9c082caaa90679bcca48c873cdc16311b` (2026-04-28), 74 commits
behind main, including major message-format and node-to-daemon
protocol changes. The example's nodes — built from the pinned rev —
no longer understood the protocol the current dora-cli speaks, so
`cargo run --example rust-dataflow-git` hung indefinitely.

This is exactly what the dataflow.yml comment warns about:

> Smoke-tests git-sourced nodes. Pins to a dora commit (not a
> released tag) so the example runs against matching message-format
> versions without needing a release. Update `rev:` when a
> message-format-breaking change lands on main — otherwise the CI
> job catches the mismatch and signals a compatibility break, which
> is the whole point of this test.

The job has done its job. Bumping the pin to current main (the
post-#1884 commit `9f4242b69720dd6c4da44260cc4102541dcf7d70`)
restores 1:1 protocol parity between the example's nodes and the
CLI they're invoked from.

## Symptoms before this fix

`ci/circleci: examples-windows` failed on 9 consecutive main commits
since 2026-05-26 05:16 UTC. `examples-linux` failed on most of them.
Both jobs timed out at the same step ("Rust Git Dataflow example",
`no_output_timeout: 30m`):

  [success ]  19.7m  Build examples + CLI binary
  [success ]  10.4m  Rust Dataflow example
  [timedout]  28.9m  Rust Git Dataflow example   <-- here

## Local validation

  pkill -9 dora; rm -rf examples/rust-dataflow-git/git
  cargo run --example rust-dataflow-git

  Before bump: nodes built from the stale rev hang during dataflow
               start (no STOP-message handshake compatibility).
  After bump:  nodes build from current main source tree, dataflow
               runs cleanly.

Full local execution takes 20+ min (build of the pinned dora source
tree + arrow + zenoh deps), so the conclusive smoke runs in CI on
the resulting workflow.

## Follow-up

The medium-term recurrence prevention discussed in #1945 (add the
example to required checks, scheduled auto-bump, or contributor
docs) is out of scope here. This PR is the short-term unblock.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants